Incorporating visual information for spoken term detection

نویسندگان

  • Shahram Kalantari
  • David Dean
  • Sridha Sridharan
چکیده

Spoken term detection (STD) is the task of looking up a spoken term in a large volume of speech segments. In order to provide fast search, speech segments are first indexed into an intermediate representation using speech recognition engines which provide multiple hypotheses for each speech segment. Approximate matching techniques are usually applied at the search stage to compensate the poor performance of automatic speech recognition engines during indexing. Recently, using visual information in addition to audio information has been shown to improve phone recognition performance, particularly in noisy environments. In this paper, we will make use of visual information in the form of lip movements of the speaker in indexing stage and will investigate its effect on STD performance. Particularly, we will investigate if gains in phone recognition accuracy will carry through the approximate matching stage to provide similar gains in the final audio-visual STD system over a traditional audio only approach. We will also investigate the effect of using visual information on STD performance in different noise environments.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploring the Incorporation of Acoustic Information into Term Weights for Spoken Document Retrieval

Standard term weighting methods derived from experience with text collections have been used successfully in various spoken document retrieval evaluations. However, the speech recognition techniques used to index the contents of spoken documents are errorful, and these mistakes are propagated into the document index file resulting in degradation of retrieval performance. It has been suggested t...

متن کامل

Can You Repeat That? Using Word Repetition to Improve Spoken Term Detection

We aim to improve spoken term detection performance by incorporating contextual information beyond traditional Ngram language models. Instead of taking a broad view of topic context in spoken documents, variability of word co-occurrence statistics across corpora leads us to focus instead the on phenomenon of word repetition within single documents. We show that given the detection of one instan...

متن کامل

"Look at this!" learning to guide visual saliency in human-robot interaction

We learn to direct computational visual attention in multimodal (i.e., pointing gestures and spoken references) human-robot interaction. For this purpose, we train a conditional random field to integrate features that reflect low-level visual saliency, the likelihood of salient objects, the probability that a given pixel is pointed at, and – if available – spoken information about the target ob...

متن کامل

On the Concept of Correct Hits in Spoken Term Detection

In most Information Retrieval (IR) tasks the aim is to find human-comprehensible items of information in large archives. One such task is the spoken term detection (STD) one, where we look for userentered keywords in a large audio database. To evaluate the performance of a spoken term detection system we have to know the real occurrences of the keywords entered. Although there are standard auto...

متن کامل

Improved Speech Summarization and Spoken Term Detection with Graphical Analysis of Utterance Similarities

We present summarization and spoken term detection (STD) approaches that take into account similarities between utterances to be scored for summary extraction or ranking in STD. A graph is constructed in which each utterance is a node. Similar utterances are connected by edges, with the edge weights representing the degree of similarity. The similarity for summarization is topical similarity; t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015